Capturing Expression Using Linguistic Information
نویسندگان
چکیده
Recognizing similarities between literary works for copyright infringement detection requires evaluating similarity in the expression of content. Copyright law protects expression of content; similarities in content alone are not enough to indicate infringement. Expression refers to the way people convey particular information; it captures both the information and the manner of its presentation. In this paper, we present a novel set of linguistically informed features that provide a computational definition of expression and that enable accurate recognition of individual titles and their paraphrases more than 80% of the time. In comparison, baseline features, e.g., tfidf-weighted keywords, function words, etc., give an accuracy of at most 53%. Our computational definition of expression uses linguistic features that are extracted from POS-tagged text using context-free grammars, without incurring the computational cost of full parsers. The results indicate that informative linguistic features do not have to be computationally prohibitively expensive to extract. Introduction Copyrights protect an author’s expression of content;1 in order to constitute potential infringement, two works need to present similar content and use a similar manner of expression. For literary works, content refers to the story or the information and expression refers to the linguistic choices of authors in presenting this content, such as authors’ choices of particular vocabulary items from a set of synonyms (e.g., “clever” vs. “smart” in sentences in (1)), whether they tend toward passive or active voice (e.g., sentences in (2)), or whether they prefer complex sentences with embedded clauses to simple sentences with independent clauses (e.g., sentences in (3)), as well as combinations of such choices. 1 (a) Jill is very clever. (b) Jill is very smart. 2 (a) The pirates sank the boat. (b) The boat was sunk by the pirates. Copyright c © 2005, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. United States Code, Title 17, Chapter 1, §102. 3 (a) The woman carrying the umbrella walked in the rain. (b) The woman walked in the rain. She was carrying an umbrella. Expression focuses on the linguistic choices of the authors and does not include layout or generic genre characteristics of documents because neither layout (such as use of titles, tables, and figures) nor genre characteristics (e.g., all poems consist of stanzas) represent linguistic choices of the authors. In this paper, we set out to create a computational definition of expression which can help evaluate similarities between literary works for copyright infringement detection. In particular, we study syntax and semantics to identify a novel set of linguistic elements that capture expression, and that provide a computational definition of expression, in the genre of narrative fiction. Given a computational definition of expression, our goal is to generate fingerprints that help differentiate between two independently copyrighted works on the same content but also help recognize infringing copies of a work even when the infringement is not verbatim (i.e., paraphrases). The ideal data set for this study would use examples of real-life infringement. Unfortunately, such a data set is not readily available. However, we have access to a corpus of parallel translations of titles; in this context, a title is an original work. Parallel translations, while not necessarily infringing, are derived from the same original title. During the translation process, translators add their own expression to the work and convey the same content in different ways, providing us with different books derived from the same title; we make this distinction between books and titles throughout this paper and rely on this distinction in our experiments. Books derived from the same title can be treated as paraphrases of each other (and of the original title) and, in the absence of real-life infringement data, serve as our surrogate. Using this surrogate data, in this paper, we build models with a novel set of linguistic features and compare the performance of these features with baselines. The success of linguistic features in recognizing titles indicates that despite the differences in the way people phrase the same content, the essence of a literary work requires certain syntactic constructs to be present (either because of the content, or because people who derive content from the same original preserve some aspects of the original). We believe that our surrogate data and our findings will generalize to real-life infringement cases: during infringement, despite efforts to paraphrase works, people use some similar constructs either to adequately convey content or because they are unwilling to rewrite the whole work—most infringers will make simple modifications to a work but are unwilling to put significant effort into re-creating it; if effort were not an issue, they would most likely create their own original rather than copying someone else’s work.
منابع مشابه
The Re-use of Linguistic Resources across Languages in Multilingual Generation Components
An approach to generation system design is described which supports maximal expression of commonalities across languages. Within this approach it becomes natural to represent inherently multilingual grammars and semantics. The approach rests on the linguistic notion of functional similarity and difference: by capturing the functions languages need to perform , we achieve a level of linguistic d...
متن کاملCapturing Outlines of Planar Generic Images by Simultaneous Curve Fitting and Sub-division
In this paper, a new technique has been designed to capture the outline of 2D shapes using cubic B´ezier curves. The proposed technique avoids the traditional method of optimizing the global squared fitting error and emphasizes the local control of data points. A maximum error has been determined to preserve the absolute fitting error less than a criterion and it administers the process of curv...
متن کاملBeyond N-Grams: Can Linguistic Sophistication Improve Language Modeling?
It seems obvious that a successful model of natural language would incorporate a great deal of both linguistic and world knowledge. Interestingly, state of the art language models for speech recognition are based on a very crude linguistic model, namely conditioning the probability of a word on a small fixed number of preceding words. Despite many attempts to incorporate more sophisticated info...
متن کاملCapturing syntactico-semantic regularities among terms: An application of the FrameNet methodology to terminology
Terminological databases do not always provide detailed information on the linguistic behaviour of terms, although this is important for potential users such as translators or students. In this paper we describe a project that aims to fill this gap by proposing a method for annotating terms in sentences based on that developed within the FrameNet project (Ruppenhofer et al. 2010) and by impleme...
متن کامل-
There are some evidences on involvement of ABL2 in a number of cancers, especially ABL2 overexpression was observed in Raji, a Burkitt's lymphoma (BL) cell line. Therefore ABL2 overexpression may be involved in pathogenesis of Burkitt's lymphoma. The aim of this study was to evaluate ABL2 significance in BL. For this purpose 20 formaldehyde fixed paraffin embedded blocks – 16BL and 4 normal lym...
متن کامل